NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

HENASY: Learning to Assemble Scene-Entities for Interpretable Egocentric Video-Language Model

Vo, Khoa; Phan, Thinh; Yamazaki, Kashu; Tran, Minh; Le, Ngan (December 2024, Curran Associates, Inc.)

Full Text Available
AerialFormer: Multi-Resolution Transformer for Aerial Image Segmentation

https://doi.org/10.3390/rs16162930

Hanyu, Taisei; Yamazaki, Kashu; Tran, Minh; McCann, Roy A; Liao, Haitao; Rainwater, Chase; Adkins, Meredith; Cothren, Jackson; Le, Ngan (August 2024, Remote Sensing)

When performing remote sensing image segmentation, practitioners often encounter various challenges, such as a strong imbalance in the foreground–background, the presence of tiny objects, high object density, intra-class heterogeneity, and inter-class homogeneity. To overcome these challenges, this paper introduces AerialFormer, a hybrid model that strategically combines the strengths of Transformers and Convolutional Neural Networks (CNNs). AerialFormer features a CNN Stem module integrated to preserve low-level and high-resolution features, enhancing the model’s capability to process details of aerial imagery. The proposed AerialFormer is designed with a hierarchical structure, in which a Transformer encoder generates multi-scale features and a multi-dilated CNN (MDC) decoder aggregates the information from the multi-scale inputs. As a result, information is taken into account in both local and global contexts, so that powerful representations and high-resolution segmentation can be achieved. The proposed AerialFormer was benchmarked on three benchmark datasets, including iSAID, LoveDA, and Potsdam. Comprehensive experiments and extensive ablation studies show that the proposed AerialFormer remarkably outperforms state-of-the-art methods.
more » « less
Full Text Available
Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation

https://doi.org/10.1109/ICRA57147.2024.10610193

Yamazaki, Kashu; Hanyu, Taisei; Vo, Khoa; Pham, Thang; Tran, Minh; Doretto, Gianfranco; Nguyen, Anh; Le, Ngan (May 2024, IEEE)

Full Text Available
CLIP-TSA: Clip-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection

https://doi.org/10.1109/ICIP49359.2023.10222289

Joo, Hyekang Kevin; Vo, Khoa; Yamazaki, Kashu; Le, Ngan (October 2023, IEEE)

Video anomaly detection (VAD) – commonly formulated as a multiple-instance learning problem in a weakly-supervised manner due to its labor-intensive nature – is a challenging problem in video surveillance where the frames of anomaly need to be localized in an untrimmed video. In this paper, we first propose to utilize the ViT-encoded visual features from CLIP, in contrast with the conventional C3D or I3D features in the domain, to efficiently extract discriminative representations in the novel technique. We then model temporal de- pendencies and nominate the snippets of interest by leveraging our proposed Temporal Self-Attention (TSA). The ablation study confirms the effectiveness of TSA and ViT feature. The extensive experiments show that our proposed CLIP-TSA outperforms the existing state-of-the-art (SOTA) methods by a large margin on three commonly-used benchmark datasets in the VAD problem (UCF-Crime, ShanghaiTech Campus and XD-Violence). Our source code is available at https:// github.com/joos2010kj/CLIP-TSA.
more » « less
Full Text Available
VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning

https://doi.org/10.1609/aaai.v37i3.25412

Yamazaki, Kashu; Vo, Khoa; Truong, Quang Sang; Raj, Bhiksha; Le, Ngan (June 2023, Proceedings of the AAAI Conference on Artificial Intelligence)

Video Paragraph Captioning aims to generate a multi-sentence description of an untrimmed video with multiple temporal event locations in a coherent storytelling. Following the human perception process, where the scene is effectively understood by decomposing it into visual (e.g. human, animal) and non-visual components (e.g. action, relations) under the mutual influence of vision and language, we first propose a visual-linguistic (VL) feature. In the proposed VL feature, the scene is modeled by three modalities including (i) a global visual environment; (ii) local visual main agents; (iii) linguistic scene elements. We then introduce an autoregressive Transformer-in-Transformer (TinT) to simultaneously capture the semantic coherence of intra- and inter-event contents within a video. Finally, we present a new VL contrastive loss function to guarantee the learnt embedding features are consistent with the captions semantics. Comprehensive experiments and extensive ablation studies on the ActivityNet Captions and YouCookII datasets show that the proposed Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms previous state-of-the-art methods in terms of accuracy and diversity. The source code is made publicly available at: https://github.com/UARK-AICV/VLTinT.
more » « less
Full Text Available
DNA: Deformable Neural Articulations Network for Template-free Dynamic 3D Human Reconstruction from Monocular RGB-D Video

https://doi.org/10.1109/CVPRW59228.2023.00375

Vo, Khoa; Pham, Trong-Thang; Yamazaki, Kashu; Tran, Minh; Le, Ngan (June 2023, IEEE)

In this paper, we present a novel Deformable Neural Articulations Network (DNA-Net), which is a template- free learning-based method for dynamic 3D human reconstruction from a single RGB-D sequence. Our proposed DNA-Net includes a Neural Articulation Prediction Net- work (NAP-Net), which is capable of representing non-rigid motions of a human by learning to predict a set of articulated bones to follow movements of the human in the in- put sequence. Moreover, DNA-Net also include Signed Distance Field Network (SDF-Net) and Appearance Network (Color-Net), which take advantage of the powerful neural implicit functions in modeling 3D geometries and appear- ance. Finally, to avoid the reliance on external optical flow estimators to obtain deformation cues like previous related works, we propose a novel training loss, namely Easy-to- Hard Geometric-based, which is a simple strategy that inherits the merits of Chamfer distance to achieve good de- formation guidance while still avoiding its limitation of lo- cal mismatches sensitivity. DNA-Net is trained end-to-end in a self-supervised manner directly on the input sequence to obtain 3D reconstructions of the input objects. Quantitative results on videos of DeepDeform dataset show that DNA-Net outperforms related state-of-the-art methods with an adequate gaps, qualitative results additionally prove that our method can reconstruct human shapes with high fidelity and details.
more » « less
Full Text Available
AOE-Net: Entities Interactions Modeling with Adaptive Attention Mechanism for Temporal Action Proposals Generation

https://doi.org/10.1007/s11263-022-01702-9

Vo, Khoa; Truong, Sang; Yamazaki, Kashu; Raj, Bhiksha; Tran, Minh-Triet; Le, Ngan (January 2023, International Journal of Computer Vision)

Full Text Available
Contextual Explainable Video Representation: Human Perception-based Understanding

https://doi.org/10.1109/IEEECONF56349.2022.10052051

Vo, Khoa; Yamazaki, Kashu; Nguyen, Phong X.; Nguyen, Phat; Luu, Khoa; Le, Ngan (October 2022, 2022 56th Asilomar Conference on Signals, Systems, and Computers,)

Full Text Available
Spiking Neural Networks and Their Applications: A Review

https://doi.org/10.3390/brainsci12070863

Yamazaki, Kashu; Vo-Ho, Viet-Khoa; Bulsara, Darshan; Le, Ngan (July 2022, Brain Sciences)

The past decade has witnessed the great success of deep neural networks in various domains. However, deep neural networks are very resource-intensive in terms of energy consumption, data requirements, and high computational costs. With the recent increasing need for the autonomy of machines in the real world, e.g., self-driving vehicles, drones, and collaborative robots, exploitation of deep neural networks in those applications has been actively investigated. In those applications, energy and computational efficiencies are especially important because of the need for real-time responses and the limited energy supply. A promising solution to these previously infeasible applications has recently been given by biologically plausible spiking neural networks. Spiking neural networks aim to bridge the gap between neuroscience and machine learning, using biologically realistic models of neurons to carry out the computation. Due to their functional similarity to the biological neural network, spiking neural networks can embrace the sparsity found in biology and are highly compatible with temporal code. Our contributions in this work are: (i) we give a comprehensive review of theories of biological neurons; (ii) we present various existing spike-based neuron models, which have been studied in neuroscience; (iii) we detail synapse models; (iv) we provide a review of artificial neural networks; (v) we provide detailed guidance on how to train spike-based neuron models; (vi) we revise available spike-based neuron frameworks that have been developed to support implementing spiking neural networks; (vii) finally, we cover existing spiking neural network applications in computer vision and robotics domains. The paper concludes with discussions of future perspectives.
more » « less
Full Text Available
VLCAP: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning

https://doi.org/10.1109/ICIP46576.2022.9897766

Yamazaki, Kashu; Truong, Sang; Vo, Khoa; Kidd, Michael; Rainwater, Chase; Luu, Khoa; Le, Ngan (October 2022, 2022 IEEE International Conference on Image Processing (ICIP))

Full Text Available

« Prev Next »

Search for: All records